Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Set resource limit for addon containers #10653

Merged
merged 4 commits into from Jul 2, 2015

Conversation

dchen1107
Copy link
Member

Measurement is based on current default configuration for GCE / GKE (?) and AWS: 4 nodes, 30~50 pods per node.

For all containers, I allocate 100m for cpu even they all use less than that since it matches today's default setting from LimitRange of default namespace. Memory is very complicated here:

  1. all containers in dns service is very stable, and use tiny bit of memory (< 5Mi), but I rounded them up to 50Mi anyway.
  2. Looks like fluentd have memory leakage issue over time, but since it is stateless, it is safe to be oom-killed for now. But it is static pod for now, I decided to do nothing on it since it is too hard to change static pods' configuration today.
  3. We observed both influxdb and heapster memory leakage issues. Witthout any new pods scheduled to cluster, over one night, the memory usage of influxdb is shooting to 2.5Gi from 150Mi, and one of heapster to 2Gi from 150Mi. I chose 200Mi now for both. Influxdb now has an emtpy dir volume attached to pod, and could be recovered.
  4. We observed elasticsearch and kibana have memory leakage issue. Since elasticsearch and kibana are not started by default for GCE and GKE, I decided to not give memory limit for now.

cc/ @bgrant0607 @thockin @saad-ali @a-robinson

@eparis and @justinsb too on valgrant case.

@bgrant0607
Copy link
Member

How often do we expect heapster and influxdb to restart, then? And these numbers are for how many pods? Do we have scaling guidance for users?

If we want to ensure that all addons have limits set, we'll need a check, but that can be done later -- file an issue for that. That would also mean that elasticsearch and kibana would need to get memory limits.

Otherwise, LGTM

@bgrant0607
Copy link
Member

#10077 was merged. The UI will need limits, presumably.

@bgrant0607 bgrant0607 added this to the v1.0 milestone Jul 1, 2015
@bgrant0607 bgrant0607 assigned thockin and unassigned bgrant0607 Jul 1, 2015
@k8s-bot
Copy link

k8s-bot commented Jul 1, 2015

GCE e2e build/test passed for commit 6b61918.

@a-robinson
Copy link
Contributor

If we're going to run fluentd on the master (#10597), it should definitely have resource limits. As you mention, it should be safe to be oom-killed.

@dchen1107
Copy link
Member Author

@bgrant0607 The problem I had today is that both heapster and influxdb don't have any limit, and they are going to oom-killed when system oom-ed eventually, so no, I don't have that restart frequency number yet. I expected to have this pr in, so that I can get that number.

The numbers I picked up here is based on 4 nodes, 100 containers per node (30 pods, and 2 containers each). We can have scaling guidance for users based on that.

@dchen1107
Copy link
Member Author

Can we fix UI in a separate PR since I really want to collect heapster and influxdb stats as earlier as possible here?

@a-robinson fluentd has slight memory leakage overtime. I will send another pr to resolve fluentd issue soon.

@a-robinson
Copy link
Contributor

Ack, thanks. Its memory leak would seem to make it even more important to put a cap on its memory usage.

@dchen1107
Copy link
Member Author

@a-robinson I pushed an extra commit to update fluentd manifest file too. For UI one, we have to wait for measurements, thus separate PR. :-)

@@ -9,6 +9,7 @@ spec:
resources:
limits:
cpu: 100m
memory: 200Mi
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't 100Mi more than enough given that in all of Saad's tests it never used that much?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, Saad's report in that issue is only for a couple of hours. The limit I chose here is based on over 2 days. Both of us are running the soaking test against our default configurations.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be clear my report was over the course of 24 hours. It looks like the fluentd container with elasticsearch plugin has a leak, because memory usage continues to grow. After two days it hit 151 MB. See #10335 (comment)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is the memory limit set for fluentd-gcp but not fluentd-es?

@a-robinson
Copy link
Contributor

Sorry for all the comments, but you should also add limits to cluster/saltbase/salt/fluentd-es/fluentd-es.yaml while you're adding them to fluend-gcp.

@thockin
Copy link
Member

thockin commented Jul 2, 2015

LGTM

@thockin thockin added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jul 2, 2015
@k8s-bot
Copy link

k8s-bot commented Jul 2, 2015

GCE e2e build/test passed for commit 54531d9.

@dchen1107
Copy link
Member Author

@a-robinson Like we mentioned above, cluster/saltbase/salt/fluentd-es/fluentd-es.yaml is not used by GKE and GCE, I would rather not choose the limit for them. My concern is reducing the schedulability for those small cells.

@dchen1107
Copy link
Member Author

Since I have two TL's lgtm now, I marked it ok-to-merge. The reason I am rushing for this in is that we are missing important data for heapster and influxdb, for example, restart frequency, etc.
I will monitor jenkin tonight, if it causes any trouble, feel free to revert it later. Thanks!

@dchen1107
Copy link
Member Author

@zmerlynn or @nikhiljindal Can I have merge for this one? I am going to monitor jenkin tonight and tomorrow. Thanks!

@bgrant0607
Copy link
Member

LGTM

@zmerlynn
Copy link
Member

zmerlynn commented Jul 2, 2015

I'm watching tonight as well.

zmerlynn added a commit that referenced this pull request Jul 2, 2015
Set resource limit for addon containers
@zmerlynn zmerlynn merged commit 1d16be6 into kubernetes:master Jul 2, 2015
@dchen1107
Copy link
Member Author

Thanks, everyone! You guys are super!

@dchen1107
Copy link
Member Author

#10335

@dchen1107
Copy link
Member Author

#10760 for memory leakage of heapster and influxdb

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lgtm "Looks good to me", indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants